Plant Seedlings Classification Project

by Michele Casalgrandi

Context

The Aarhus University Signal Processing group released a dataset of images of unique plants belonging to 12 species at several growth stages.

Goals

Data

Import needed libraries

Exploratory Data Analysis - EDA

Load images and labels

The dataset contains 4750 images of size 128x128 and 3 channels

Review a sample of the images

Images are stored as BGR - we will convert to RGB so we can use matplotlib imshow()

Select random sample from the images

Pixel intensities histograms

The pixel intensities are skewed to the right, indicating the images tend to be dark.
The narrow shape of the distribution indicates they are low contrast.

Images with the ruler have a second peak due to the bright white areas of the ruler.

Labels

The classes are imbalanced.
E.g. There are roughly three time as many 'Loose Silky-bent' as 'Common wheat'

Data Preprocessing

Normalization

Gaussian Blur

Resize one image and test gaussian blur.

Resize and blur the entire array of images

View the images for each category after resizing and blurring

One-hot encode labels

Print label for the first image (index = 0)

Split the data into train, validation and test sets.

Reshape

The data can be used as is for modeling in keras so there is no need to reshape.

Build CNN model

We will use the 'Adam' optimizer and 'categorical_crossentropy' as loss function by default.

Define Layers

Model Evaluation

Accuracy for validation set is low.

The model is overfit as shown by the rise in validation loss after the initial drop.
Using a higher epoch number will not help.

Validation Set Confusion Matrix - Observations

Model Predictions - Test set

(This is listed as part of the assigment in the pdf version of the project)

Model improvements

Padding

We'll set padding to 'same' to see if keeping a larger image in the pipeline will improve performance and raise the number of epochs.

Model 2

Model 2 evaluation

The model accuracy is lower than the first model.

There might possibly be more accuracy to be had with more epochs.

Next, we will try to simplify the model and see if that improves performance.

Model 3

Model 3 evaluation

Performance improved over previous model but is lower than the original

Model 4

We will try raising the units in the dense layer as compared to the first model

Model 4 Evaluation

This model got stuck.

Model 5

We will train using class weights (computed previously when splitting the dataset).

Model 5 Evaluation

This model got stuck

Test Set Confusion Matrix and Performance

Performance against the test data with the initial model is close to the accuracy against the validation set with the same model.

The model generalizes well.

The confusion matrix is similar to that of validation set predictions in that similar classification errors are made.
E.g. Black-grass is often predicted to be Loose Silky-bent

Conclusions and Takeaways

Possible improvements